A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Lets start with a hello world program (matching characters):


In [ ]:
#module for regular expressions
import re
match = re.search('rld', 'hello world')
if match:
    print match.group()
else:
    print "No matching pattern"

In the above example re.search(pat, str) searches for the pattern pat in the string str. If the search is sucessful, a match object is returned, else it would return None. The code match.group() would return the matching text.

Basic Patterns

a, X, 9 -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings
. (a period) -- matches any single character except newline '\n'
</code>\w</code> -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
\W (upper case W) matches any non-word character.
\b -- boundary between word and non-word
\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f].
\S (upper case S) matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- decimal digit [0-9]
^ = start, $ = end -- match the start or end of the string
\ -- inhibit the "specialness" of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

Below are some examples:


In [ ]:
# . matches to any character except to '\n'
match = re.search('..o', 'hello pythonistas')
print match.group()

In [ ]:
# \w matches to any word character
match = re.search('\w\w', 'hello pythonistas')
print match.group()

In [ ]:
# \d matches to any digit character
match = re.search('\w\d\d', 'abc123')
print match.group()

You can try out few more examples not mentioned above.


In [ ]:

Repetitions

\+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
\* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left

Remeber the rule Leftmost and Largest : The search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible

Below are some examples:


In [ ]:
match = re.search(r'l+o', 'hellllllllllllllllo world')
print match.group()

In [ ]:
# Note that this regex would not search the next set of 'llllllo'
# Leftmost and Largest
match = re.search(r'l+o', 'hellllllllloaaaalllllo world')
print match.group()

In [ ]:
# \s* = zero or more whitespace chars
# Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
print 'first match', match.group()
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')
print 'second match', match.group()
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print 'third match', match.group()

In [ ]:
#Lets try to find email in a particular string
match = re.search(r'\w+@\w+', 'foo blah blah foo@bar.com')
print match.group()

In the above example you could see the code returned foo@bar, instead of foo@bar.com. The reason being that "." is not considered as word character.
Here comes the concept of square brackets:
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. So the next code:


In [ ]:
match = re.search(r'[\w.-]+@[\w.-]+', 'foo blah blah foo-foo@bar.com')
print match.group()

findall
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings, with each string representing one match.


In [ ]:
string = 'foo-bar@foo.com blah blah blah hello@world.com foo blah'
emails = re.findall(r'[\w.-]+@[\w.-]+', string)
for email in emails:
    print email

In [ ]:
# Suppose we want to have username and host separately
# for that we can use ()
emails = re.findall(r'([\w.-]+)@([\w.-]+)', string)
for email_tup in emails:
    print 'username = ' + email_tup[0]
    print 'host = ' + email_tup[1]

In [ ]:
from IPython.core.display import Image
Image(filename='files/regular_expressions.png')
#The same features also available in python

This was an introductory level regular expressions. More details can be found over: http://docs.python.org/2/library/re.html http://docs.python.org/2/howto/regex.html